Visualizing a Billion Rows of Data

Apache Arrow Visualizing

A post about using apache arrow and efficiently visualizing in R

Duncan Gates
2023-03-03

The NYC Taxi Data

Almost everyone has heard of the NYC taxi data at this point, in its current form it is about 24 columns and just under 1.7 billion rows. Each row represents a ride that occurred somewhere between 2009 and 2022. The important columns in this visualization are

Libraries

There are the libraries used for tranforming and visualizing.

Loading the Data

This transfers the nearly 70 gigabytes of taxi data to my computer.

copy_files(
  from = s3_bucket("ursa-labs-taxi-data-v2"),
  to = "~/Downloads/nyc-taxi"
)

The datasets can subsequently be opened as such.

nyc_taxi_tiny <- open_dataset("~/Downloads/nyc-taxi-tiny/")
nyc_taxi <- open_dataset("~/Downloads/nyc-taxi/")

Checking it out

glimpse(nyc_taxi_tiny)

Plotting a million rows

tic()
nyc_pickups <- nyc_taxi_tiny |>
  select(pickup_longitude, pickup_latitude) |>
  filter(
    !is.na(pickup_longitude),
    !is.na(pickup_latitude)
  ) |>
  collect()
toc()

Let’s check out nyc_pickups

glimpse(nyc_pickups)

This goes pretty quick now

x0 <- -74.05 # minimum longitude to plot
y0 <- 40.6   # minimum latitude to plot
span <- 0.3  # size of the lat/long window to plot

tic()
pic <- ggplot(nyc_pickups) +
  geom_point(
    aes(pickup_longitude, pickup_latitude), 
    size = .2, 
    stroke = 0, 
    colour = "#800020"
  ) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_void() +
  coord_equal(
    xlim = x0 + c(0, span), 
    ylim = y0 + c(0, span)
  )
pic
toc()

Plotting a billion rows

tic()
pixels <- 4000
pickup <- nyc_taxi |>
  dplyr::filter(
    !is.na(pickup_longitude),
    !is.na(pickup_latitude),
    pickup_longitude > x0,
    pickup_longitude < x0 + span,
    pickup_latitude > y0,
    pickup_latitude < y0 + span
  ) |>
  dplyr::mutate(
    unit_scaled_x = (pickup_longitude - x0) / span,
    unit_scaled_y = (pickup_latitude - y0) / span,
    x = as.integer(round(pixels * unit_scaled_x)), 
    y = as.integer(round(pixels * unit_scaled_y))
  ) |>
  dplyr::group_by(x, y) |>
  dplyr::summarise(pickup = n()) |>
  # dplyr::count(x, y, name = "pickup") |>
  dplyr::collect()
toc()

My laptop solves this in 27.51 seconds, again lets take a look at the resulting dataframe.

glimpse(pickup)
tic()
grid <- expand_grid(x = 1:pixels, y = 1:pixels) |>
  left_join(pickup, by = c("x", "y")) |>
  mutate(pickup = replace_na(pickup,  0))
toc()
tic()
pickup_grid <- matrix(
  data = grid$pickup,
  nrow = pixels,
  ncol = pixels
)
toc()
render_image <- function(mat, cols = c("white", "#800020")) {
  op <- par(mar = c(0, 0, 0, 0))
  shades <- colorRampPalette(cols)
  image(
    z = log10(t(mat + 1)),
    axes = FALSE,
    asp = 1,
    col = shades(1000),
    useRaster = TRUE
  )
  par(op)
}

Rendering and saving

tic()
png(file = here::here("imgs/taxi_viz_billion_rows.png"),
    bg = "#27233a",
    width = 4000,
    height = 4000)
render_image(pickup_grid, cols = c("#27233a", "white", "#F46036"))
dev.off()
toc()
knitr::include_graphics(here::here("imgs/taxi_viz_billion_rows.png"))

Citation

For attribution, please cite this work as

Gates (2023, March 3). The Data Dive: Visualizing a Billion Rows of Data. Retrieved from https://duncangates.me/blog/site/posts/2023-03-02-visualizing-a-billion-rows-of-data-in-r-with-apache-arrow/

BibTeX citation

@misc{gates2023visualizing,
  author = {Gates, Duncan},
  title = {The Data Dive: Visualizing a Billion Rows of Data},
  url = {https://duncangates.me/blog/site/posts/2023-03-02-visualizing-a-billion-rows-of-data-in-r-with-apache-arrow/},
  year = {2023}
}